AI Security

AIディフェンス研究所

https://jpsec.ai/blog/

Security camp感想

https://ryoryon66.hatenablog.com/entry/2022/10/03/103859

AIJack

https://github.com/Koukyosyumei/AIJack

PySyft

https://github.com/OpenMined/PySyft

Generative AI and Large Language Models for Cyber Security: All Insights You Need

https://arxiv.org/abs/2405.12750

Security of LLM Information Hub

https://tasuku-sasaki-lab.github.io/Tasuku-Sasaki.github.io/LLM-Security/

TrustLLM: Trustworthiness in Large Language Models

https://arxiv.org/abs/2401.05561

Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models

https://arxiv.org/abs/2403.04786v2

A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly

https://arxiv.org/abs/2312.02003

Golden Gate Claude

https://www.anthropic.com/news/golden-gate-claude

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#/

Improving Alignment and Robustness with Circuit Breakers

https://arxiv.org/abs/2406.04313v2

OpenAI APIで、問題のある発言を検出するmoderationモデルを試してみた

https://dev.classmethod.jp/articles/openai-api-moderation-model/

ChatGPT "DAN" (and other "Jailbreaks")

https://github.com/0xk1h0/ChatGPT_DAN

Universal and Transferable Adversarial Attacks on Aligned Language Models

https://llm-attacks.org/

NeMo-Guardrails

https://github.com/NVIDIA/NeMo-Guardrails?tab=readme-ov-file#nemo-guardrails

LLMにおけるガードレールについて

https://zenn.dev/ayumuakagi/articles/llm_guardrails

【2024.9.9 AIアライメントネットワーク設立記念シンポジウム】#1「ALIGNの挑戦」髙橋恒一（ALIGN代表理事）

https://www.youtube.com/watch?v=_13ORbYifbU&t=910s

Singular Learning TheoryとAI Alignmentが結びつくのが面白い。そろそろ寝なければと思っていたのに、眠れなくなった。アラインメントとFree Energyが結びつき、脳のことにまで発想が広がってきた。

Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks

https://arxiv.org/abs/2410.20911

多文化・他言語対応の安全な大規模言語モデルの構築を目指して

https://www.youtube.com/watch?v=NLaayZ4v6Ag

LLMのアウトプットをバリデーションする関数が集まるGuardrails Hubを試す

https://zenn.dev/gaudiy_blog/articles/9de43ed4b260ce

BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models

https://arxiv.org/abs/2410.09804

SLM as Guardian: Pioneering AI Safety with Small Language Models

https://arxiv.org/abs/2405.19795

LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild

https://arxiv.org/abs/2410.13919

Hacker Panel: What Hackers Can Tell You About AI Security

https://www.youtube.com/watch?v=eoXouUA1raQ

LLMjackingがDeepSeekを標的にする

https://sysdig.jp/blog/llmjacking-targets-deepseek/

OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities

https://arxiv.org/abs/2502.15797

AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement

https://arxiv.org/abs/2502.16776

AIセーフティに関するレッドチーミング手法ガイド Guide to Red Teaming Methodology on AI Safety

https://aisi.go.jp/effort/effort_framework/guide_to_red_teaming_methodology_on_ai_safety/